Search CORE

59 research outputs found

Recommended from our members

Unifying regression testing with mutation testing

Author: Zhang Lingming
Publication venue
Publication date: 07/07/2014
Field of study

textSoftware testing is the most commonly used methodology for validating quality of software systems. Conceptually, testing is simple, but in practice, given the huge (practically infinite) space of inputs to test against, it requires solving a number of challenging problems, including evaluating and reusing tests efficiently and effectively as software evolves. While software testing research has seen much progress in recent years, many crucial bugs still evade state-of-the-art approaches and cause significant monetary losses and sometimes are responsible for loss of life. My thesis is that a unified, bi-dimensional, change-driven methodology can form the basis of novel techniques and tools that can make testing significantly more effective and efficient, and allow us to find more bugs at a reduced cost. We propose a novel unification of the following two dimensions of change: (1) real manual changes made by programmers, e.g., as commonly used to support more effective and efficient regression testing techniques; and (2) mechanically introduced changes to code or specifications, e.g., as originally conceived in mutation testing for evaluating quality of test suites. We believe such unification can lay the foundation of a scalable and highly effective methodology for testing and maintaining real software systems. The primary contribution of my thesis is two-fold. One, it introduces new techniques to address central problems in both regression testing (e.g., test prioritization) and mutation testing (e.g., selective mutation testing). Two, it introduces a new methodology that uses the foundations of regression testing to speed up mutation testing, and also uses the foundations of mutation testing to help with the fault localization problem raised in regression testing. The central ideas are embodied in a suite of prototype tools. Rigorous experimental evaluation is used to validate the efficacy of the proposed techniques using a variety of real-world Java programs.Electrical and Computer Engineerin

Texas ScholarWorks

Copiloting the Copilots: Fusing Large Language Models with Completion Engines for Automated Program Repair

Author: Wei Yuxiang
Xia Chunqiu Steven
Zhang Lingming
Publication venue
Publication date: 01/09/2023
Field of study

During Automated Program Repair (APR), it can be challenging to synthesize correct patches for real-world systems in general-purpose programming languages. Recent Large Language Models (LLMs) have been shown to be helpful "copilots" in assisting developers with various coding tasks, and have also been directly applied for patch synthesis. However, most LLMs treat programs as sequences of tokens, meaning that they are ignorant of the underlying semantics constraints of the target programming language. This results in plenty of statically invalid generated patches, impeding the practicality of the technique. Therefore, we propose Repilot, a framework to further copilot the AI "copilots" (i.e., LLMs) by synthesizing more valid patches during the repair process. Our key insight is that many LLMs produce outputs autoregressively (i.e., token by token), resembling human writing programs, which can be significantly boosted and guided through a Completion Engine. Repilot synergistically synthesizes a candidate patch through the interaction between an LLM and a Completion Engine, which 1) prunes away infeasible tokens suggested by the LLM and 2) proactively completes the token based on the suggestions provided by the Completion Engine. Our evaluation on a subset of the widely-used Defects4j 1.2 and 2.0 datasets shows that Repilot fixes 66 and 50 bugs, respectively, surpassing the best-performing baseline by 14 and 16 bugs fixed. More importantly, Repilot is capable of producing more valid and correct patches than the base LLM when given the same generation budget

arXiv.org e-Print Archive

NeuRI: Diversifying DNN Generation via Inductive Rule Inference

Author: Liu Jiawei
Peng Jinjun
Wang Yuyao
Zhang Lingming
Publication venue
Publication date: 04/09/2023
Field of study

Deep Learning (DL) is prevalently used in various industries to improve decision-making and automate processes, driven by the ever-evolving DL libraries and compilers. The correctness of DL systems is crucial for trust in DL applications. As such, the recent wave of research has been studying the automated synthesis of test-cases (i.e., DNN models and their inputs) for fuzzing DL systems. However, existing model generators only subsume a limited number of operators, lacking the ability to pervasively model operator constraints. To address this challenge, we propose NeuRI, a fully automated approach for generating valid and diverse DL models composed of hundreds of types of operators. NeuRI adopts a three-step process: (i) collecting valid and invalid API traces from various sources; (ii) applying inductive program synthesis over the traces to infer the constraints for constructing valid models; and (iii) using hybrid model generation which incorporates both symbolic and concrete operators. Our evaluation shows that NeuRI improves branch coverage of TensorFlow and PyTorch by 24% and 15% over the state-of-the-art model-level fuzzers. NeuRI finds 100 new bugs for PyTorch and TensorFlow in four months, with 81 already fixed or confirmed. Of these, 9 bugs are labelled as high priority or security vulnerability, constituting 10% of all high-priority bugs of the period. Open-source developers regard error-inducing tests reported by us as "high-quality" and "common in practice"

arXiv.org e-Print Archive

Fuzzing Deep-Learning Libraries via Automated Relational API Inference

Author: Deng Yinlin
Wei Anjiang
Yang Chenyuan
Zhang Lingming
Publication venue
Publication date: 12/07/2022
Field of study

A growing body of research has been dedicated to DL model testing. However, there is still limited work on testing DL libraries, which serve as the foundations for building, training, and running DL models. Prior work on fuzzing DL libraries can only generate tests for APIs which have been invoked by documentation examples, developer tests, or DL models, leaving a large number of APIs untested. In this paper, we propose DeepREL, the first approach to automatically inferring relational APIs for more effective DL library fuzzing. Our basic hypothesis is that for a DL library under test, there may exist a number of APIs sharing similar input parameters and outputs; in this way, we can easily "borrow" test inputs from invoked APIs to test other relational APIs. Furthermore, we formalize the notion of value equivalence and status equivalence for relational APIs to serve as the oracle for effective bug finding. We have implemented DeepREL as a fully automated end-to-end relational API inference and fuzzing technique for DL libraries, which 1) automatically infers potential API relations based on API syntactic or semantic information, 2) synthesizes concrete test programs for invoking relational APIs, 3) validates the inferred relational APIs via representative test inputs, and finally 4) performs fuzzing on the verified relational APIs to find potential inconsistencies. Our evaluation on two of the most popular DL libraries, PyTorch and TensorFlow, demonstrates that DeepREL can cover 157% more APIs than state-of-the-art FreeFuzz. To date, DeepREL has detected 162 bugs in total, with 106 already confirmed by the developers as previously unknown bugs. Surprisingly, DeepREL has detected 13.5% of the high-priority bugs for the entire PyTorch issue-tracking system in a three-month period. Also, besides the 162 code bugs, we have also detected 14 documentation bugs (all confirmed).Comment: Accepted at ESEC/FSE 202

arXiv.org e-Print Archive

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Author: Liu Jiawei
Wang Yuyao
Xia Chunqiu Steven
Zhang Lingming
Publication venue
Publication date: 02/05/2023
Field of study

Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code according to user intent written in natural language. Code evaluation datasets, containing curated synthesis problems with input/output test-cases, are used to measure the performance of various LLMs on code synthesis. However, test-cases in these datasets can be limited in both quantity and quality for fully assessing the functional correctness of the generated code. Such limitation in the existing benchmarks begs the following question: In the era of LLMs, is the code generated really correct? To answer this, we propose EvalPlus -- a code synthesis benchmarking framework to rigorously evaluate the functional correctness of LLM-synthesized code. In short, EvalPlus takes in the base evaluation dataset and uses an automatic input generation step to produce and diversify large amounts of new test inputs using both LLM-based and mutation-based input generators to further validate the synthesized code. We extend the popular HUMANEVAL benchmark and build HUMANEVAL+ with 81x additionally generated tests. Our extensive evaluation across 14 popular LLMs demonstrates that HUMANEVAL+ is able to catch significant amounts of previously undetected wrong code synthesized by LLMs, reducing the pass@k by 15.1% on average! Moreover, we even found several incorrect ground-truth implementations in HUMANEVAL. Our work not only indicates that prior popular code synthesis evaluation results do not accurately reflect the true performance of LLMs for code synthesis but also opens up a new direction to improve programming benchmarks through automated test input generation

arXiv.org e-Print Archive